4 research outputs found

    PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability

    Full text link
    We propose a novel computing runtime that exposes remote compute devices via the cross-vendor open heterogeneous computing standard OpenCL and can execute compute tasks on the MEC cluster side across multiple servers in a scalable manner. Intermittent UE connection loss is handled gracefully even if the device's IP address changes on the way. Network-induced latency is minimized by transferring data and signaling command completions between remote devices in a peer-to-peer fashion directly to the target server with a streamlined TCP-based protocol that yields a command latency of only 60 microseconds on top of network round-trip latency in synthetic benchmarks. The runtime can utilize RDMA to speed up inter-server data transfers by an additional 60% compared to the TCP-based solution. The benefits of the proposed runtime in MEC applications are demonstrated with a smartphone-based augmented reality rendering case study. Measurements show up to 19x improvements to frame rate and 17x improvements to local energy consumption when using the proposed runtime to offload AR rendering from a smartphone. Scalability to multiple GPU servers in real-world applications is shown in a computational fluid dynamics simulation, which scales with the number of servers at roughly 80% efficiency which is comparable to an MPI port of the same simulation.Comment: 13 pages, 17 figure

    PoCL-R : A Scalable Low Latency Distributed OpenCL Runtime

    Get PDF
    Offloading the most demanding parts of applications to an edge GPU server cluster to save power or improve the result quality is a solution that becomes increasingly realistic with new networking technologies. In order to make such a computing scheme feasible, an application programming layer that can provide both low latency and scalable utilization of remote heterogeneous computing resources is needed. To this end, we propose a latency-optimized scalable distributed heterogeneous computing runtime implementing the standard OpenCL API. In the proposed runtime, network-induced latency is reduced by means of peer-to-peer data transfers and event synchronization as well as a streamlined control protocol implementation. Further improvements can be obtained streaming of source data directly from the producer device to the compute cluster. Compute cluster scalability is improved by distributing the command and event processing responsibilities to remote compute servers. We also show how a simple optional dynamic content size buffer OpenCL extension can significantly speed up applications that utilize variable length data. For evaluation we present a smartphone-based augmented reality rendering case study which, using the runtime, receives 19× improvement in frames per second and 17× improvement in energy per frame when offloading parts of the rendering workload to a nearby GPU server. The remote kernel execution latency overhead of the runtime is only 60 ms on top of the network roundtrip time. The scalability on multi-server multi-GPU clusters is shown with a distributed large matrix multiplication application.acceptedVersionPeer reviewe

    Distributed Low Latency Computing With OpenCL: A Scalable Multi-Access Edge Computing Framework

    Get PDF
    The ever increasing computational complexity of applications requires increasing amounts of processing power, yet users are increasingly moving to resource and power constrained mobile devices for their computational needs. This calls for creative solutions that provide increased processing capabilities without impacting battery life or degrading the user experience. Multi-Access Edge Computing is a standardization effort to provide consistent cloud edge environments for optimizing applications on low-power devices by enabling developers to offload parts of the application to networked computing infrastructure that is located physically close to the device running the application. This master’s thesis describes pocl-r, a framework for transparently offloading computation in applications that use the OpenCL API for heterogeneous computation. The implementation performs comparably to previous work in synthetic benchmarks while offering greater flexibility to application developers by not depending on 3rd party communication frameworks and not requiring the application to be aware of any particular OpenCL API extensions. In addition to synthetic benchmarks, the impact of offloading heavy computation is measured in a case study of a mobile application that renders a streamed animated point cloud. The resulting energy consumption when offloading was measured to be roughly half of what it was without offloading. When additionally making the application aware of a minimal extension to the OpenCL API, energy consumption per frame was cut to a roughly a 20th of the original while also increasing the framerate tenfold

    Efficient reconfigurable CPS for monitoring the elderly at home via Deep Learning

    Get PDF
    The rapid growth of the aging population poses serious challenges for our society e.g. the increase of the health care costs. Fueled by the information and communication technologies (ICT), Smart-Health Cyber-Physical Systems (CPS) offer innovative solutions to ease this burden. This work proposes a general framework for CPS-based solutions that adapt in run-time to optimize the overall performance of the system, continuously monitoring system working qualities such as bandwidth utilization, response time, accuracy, or energy consumption. Adaptation is done through automatic reconfiguration of processing Deep Learning models deployed on local edges. Local processing is performed in embedded devices that provide short latency and real-time processing despite their limited computation capacity compared to high-end cloud servers. Furthermore, a virtualization platform allows the system to offload any occasional intensive computation to such shared high-end servers. This paper demonstrates this reconfigurable CPS in a challenging scenario: indoor ambient assisted living for the elderly. Our system collects lifestyle user data in a non-invasive manner to promote healthy habits and triggers alarms in case of emergency to foster autonomy. Local edge video processing nodes identify indoor activities powered by state-of-the-art deep-learning action recognition models. The optimized embedded nodes allow the system to locally reduce cost and power consumption but the systems needs to maximize the overall performance in a changing environment. To that end, our solution enables run-time reconfiguration to adapt in terms of functionality or resource availability. The experimental section shows a real-world setup performing run-time adaptation with different reconfiguration policies. For that example, run-time adaptation extends the working time in more than 60% or increases critical action recognition accuracy up to 3x
    corecore